Hierarchical vs. flat n-gram-based text categorization: Can we do better?

نویسندگان

  • Jelena Graovac
  • Jovana J. Kovacevic
  • Gordana Pavlovic-Lazetic
چکیده

Hierarchical text categorization (HTC) refers to assigning a text document to one or more most suitable categories from a hierarchical category space. In this paper we present two HTC techniques based on kNN and SVM machine learning techniques for categorization process and byte n-gram based document representation. They are fully language independent and do not require any text preprocessing steps, or any prior information about document content or language. The effectiveness of the presented techniques and their language independence are demonstrated in experiments performed on five tree-structured benchmark category hierarchies that differ in many aspects: Reuters-Hier1, Reuters-Hier2, 15NGHier and 20NGHier in English and TanCorpHier in Chinese. The results obtained are compared with the corresponding flat categorization techniques applied to leaf level categories of the considered hierarchies. While kNN-based flat text categorization produced slightly better results than kNN-based HTC on the largest TanCorpHier and 20NGHier datasets, SVM-based HTC results do not considerably differ from the corresponding flat techniques, due to shallow hierarchies; still, they outperform both kNN-based flat and hierarchical categorization on all corpora except the smallest Reuters-Hier1 and Reuters-Hier2 datasets. Formal evaluation confirmed that the proposed techniques obtained state-of-the-art results.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Hierarchical Bayesian Clustering for Automatic Text Classification

Text classification, the grouping of texts into several clusters, has been used as a means of improving both the efficiency and the effectiveDess of text retrieval/categorization In this paper we propose a hierarchical clustering algor i thm that constructs a Bet of clusters having the maximum Bayesian posterior probability, the probability that the given texts are classified into clusters We c...

متن کامل

Large-scale Structural Reranking for Hierarchical Text Categorization

Current hierarchical text categorization (HTC) methods mainly fall into three directions: (1) Flat one-vs.-all approach, which flattens the hierarchy into independent nodes and trains a binary one-vs.-all classifier for each node. (2) Top-down method, which uses the hierarchical structure to decompose the entire problem into a set of smaller subproblems, and deals with such sub-problems in top-...

متن کامل

Using Boolean Rule Extraction for Taxonomic Text Categorization for Big Data

Categorization hierarchies are ubiquitous in big data. Examples include MEDLINE’s Medical Subject Headings (MeSH) taxonomy, United Nations Standard Products and Services Code (UNSPSC) product codes, and the Medical Dictionary for Regulatory Activities (MedDRA) hierarchy for adverse reaction coding. A key issue is that in most taxonomies the probability of any particular example being in a categ...

متن کامل

Improving the Operation of Text Categorization Systems with Selecting Proper Features Based on PSO-LA

With the explosive growth in amount of information, it is highly required to utilize tools and methods in order to search, filter and manage resources. One of the major problems in text classification relates to the high dimensional feature spaces. Therefore, the main goal of text classification is to reduce the dimensionality of features space. There are many feature selection methods. However...

متن کامل

Text Categorization Techniques for Intrusion Detection -- A N-Gram-Based Method

Text categorization techniques have been used in anomaly intrusion detection by Liao and Vermuri in USENIX 02 paper. [1] Another n-gram-based text categorization method proposed in this report is expected to improve the performance of intrusion detection system that implements Liao’s method.

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:
  • Comput. Sci. Inf. Syst.

دوره 14  شماره 

صفحات  -

تاریخ انتشار 2017